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FOREWORD 


Since the AFOSR Advanced Methodologies program beg ari in. 1970,. 
the experimental techniques described in its reports have been . 
- applied by a number of investigators to their own problems. 
ALthough some of the resultant experiments have had serious 
methodological deficiencies, the majority were published only as 
organizational reports with limited distribution. The series of 
papers reviewed in this report, however, was published in a lead- 
ing human factors journal and read by many investigators in the 
field. The experiments in the series .were presented as vaasonable 
examples of how the response surface methodology should be used. 


Unfortunately, they were not. 


That series may well represent the only exposure many in- 
vestigators will get to this new and important approach to 
psychological research. The experiments in the series have 
already ews used as models iene which other investigators have 
designed and conducted their own experiments. As a consequence, 
the methodological weaknesses that do exist in the series are 
being proliferated. This peccee is written to alert potential 
users of central-composite designs and response surface methodol- 
ogy to those weaknesses that affect both auldeation and interpre- 
tation, and to offer eacaceuowine guidance. The distinction 
between using an experimental design, that is, a pattern of data- 
collection points, and employing an experimental strategy is 
emphasized. 

Charles W. Simon 
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INTRODUCTION 


A series of five articles was published in a special edition 
of Human Factors, August 1973, purporting to explain and’ illus- 
trate the characteristics and applications of central-composite 
designs (CCDs) in the context of response-surface methodology (RSM). 
In the first article by Clark and Williges (1973), the approach 
developed by G. E. P. Box and associates is described along with 
some "design modification" proposed by the authors. In the re- 
maining four articles by Williges and Baron, North, 'or Mills, 
experiments are described that attempt to illustrate how RSM-CCDs 
should be used, .to examine empirically the effects of the -"design 


modifications," and to evaluate the effectiveness of ccps?. 


The series is important because it wiccaadad ie aroustne 
among human factors investigators considerable interest te this 
powerful experimental methodology. Since these articles ‘are 
currently being used as model examples of how to apply this 


methodology, a critical examination and evaluation of the sériés 


ae 


six particular papers will be referred to a great many times in 
this paper. To minimize the effect of this intrusion into the text, 
a special notation will be employed. Two letters designating the 
two author's names will be given, thus: Clark and Williges (CW), 
Williges and Baron (WB), Williges and North (WN), Mills and Williges 
(MW), and Williges and Mills (WM), all in a series of papers in 
Human Factors, 1973. The same will be given for the 1958 paper by 
Box and Hunter (BH). The author notation will be followed, if 
necessary, by the page number, and then by the number of the para- 
graph (counting any incomplete.paragraph at the beginning of the 
page) in which the reference is to be found. When no paragraph 
number is present, the reference is to a figure or table on the 
designated page or the entire page. Occasionally a specific loca- 
tion, e.g. "summary" or "footnote" is substituted for the para- 
graph number. Thus, for example, (BH169;,1) refers to the first 
paragraph on page 169 in the paper by Box and Hunter (1958). 
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is in order. The series, .considered collectively, will be. 
peviewen here to show where and why the experimental ‘papers: 
1. Fail to or improperly apply the most important and 
useful features of “response surface methodology" 
Soo as proposed by, G. B. Pi. Box, and his assoc- 
iates. . : 
2. Employ TUSEEL CHALE procedures not specific to RSM, 
' that permit interpretations. of the results not con- 
sidered by the investigators. . 
3. Do not constitute an experimental evaluation of the 


effectiveness of RSM central-composite designs as_ 
@ 


suggested . by the investigators. 
Each of these statements will be supported in considerable detail. 
in the major sections that follow this brief introduction to RSM: 


and CCD. 


RSM and CCD . 

Since Box and Wilson's (1951) original article, an extensive 
literature has evolved on the development and applications of © 
‘response surface designs (Hill and Hinker, 1966) uyeEe) LS 7) a 
The effectiveness of these designs in chemical research is well 
agtabldnhed. The term "response surface," as used here, refers 
to the estimated responses at points throught the multivariate 
space expressed in the form of an approximating seieiemint. For 
two or three variables, the surface can be represented by a contour 
map. Response surfaces can be derived from any experimental plan 
when the collected data is analyzed using a regression model, and 


as such are not unique. 


“Response surface methddalegy™ on ‘the other. hand’is the 
particular approach proposed by Box and his, associates that 
includes a viable research philosophy, an economical data point 
| pattern, a flexible data collection: strategy, and, an iterative 
data editection ane analyats process among its major contributions. 
The "central-composite design” (CCD) -referred to in the Williges 
articles is one of a sumed oe "response wileane gesigns" in which 
the pec edinetes of the data collection points satisfy the charac- 
teristics specified by the methodology. The coordinates of the 
complete CCD form the geometric patterns of a hypercube des ton 
combined with a hyperstar design (a measure polytope) and a number 
of center ‘points. The. geometric configuration fora completed CCD 


for three independent variables is shown in Figure 1. ; 


Other cesponse surface designs have been developed from such . 
spatial arrangements as vem ancae hexagons, imcomplete factorial 
‘blocks, dodecahedrons, noncentrally-arranged hypercubes and polytopes, 
tetrahedrons plus octahedrons, as well as sets of hyper spheres (Box 


and Hunter, 1953; DeBaun, 1959; Myers, 1971). 
o 3 ; _ 


Response surface designs such as the CCD are availanle for 
estimating first or second order surfaces; otHers da aia a of 
estinating third order surfaces. Some designs require an equal 
mumber of levels for each variable; others have been developed for. 
handling variable’s at two and three levels and at two and four 


levels. All of the designs emphasize economy in Gata collection’. 
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"” Figure 1. .Coordinates of data points in the central-composite 
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A summary of some of the more useful liga hl for multifactor 


tesearch in engineering psychology is given by Simon (1973). 
The CCD is perhaps one of the setter designs for employing 


the powerful methodological features proposed by Box and his 


ASRUSS SESS, Its chief limttecdous are Enss it requires five 


€ 


levels of each variable to be geleated at specific locations on. 
yie 

a continuous scale. Also, the investigator must be reasonably 

sonfidané that tS surface he intends to approximate can be fit 


_by a first or second order model; the CCD was never intended to 


£it.a Higher-order model although this can be ein etd with a> 


great deal of extra effort? 


+ ‘ ‘ 
RSM,. when properly applied, provides the user with an ex- 


tremely economical, efficient, and flexible, research plan. The 


vSEY characteristics that make it most. effective can be the ones 
to which it will be most difficult ee the psychologist, nurtured — 
primarily on factorial designs and ANOVA models, to adapt. 

The Box and Hunter (1958) paper clearly and succinctly sum- 
marizes- much of the original thinking on response surface method- 


~ 


‘ 
ology and shows how it affects the Reve Lep iets of experimental 


designs. In that paper, the authors present and ~ support the 
following desirable characteristics which an experimental design 


‘for fitting response surfaces should include whenever possible: 


‘ 


orheke are response surface designs available for fitting third 
order models if the experimenter can anticipate their necessity 
on the basis of some preliminary tests (Das and Narasimham, 1962). 
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Utilize a grid of data points of minimum density over a 


multivariate space of greatest practical interest. 


’ Allow for approximating a polynomial of an order tenta-— 


tively assumed to be representationally adequate to fit 


the response surface (BH143,2i); when no assumption is 


made of the form of the function initially, one starts 


with a first-order polynomial model (BH143,1). 

Allow a check on the adequacy of the function by allowing 

certain combinations of higher order terms to be examined 

(BH143,2ii). 

Permit the already completed design of order d to form 

the the nucleus from which a design of order d +l may be 

built, if the assumed polynomial proves inadequate 

(BN143,2iii). 

Permit. blocking (BH143,2iv) which 

a. helps maintain a eile iets experimental environment 
when an experimental program is extended over many 
data.points and time, and 

b. permits an experiment to ‘be carried out sequentially, 
so that wantake changes can be mad in the experi- 
mental plan based on information obt ined from the 


ta 
previous data collection period. 


Be "rotatable" so that the orthogonal axes of the experi- 


mental design can take any orientation without changing 


the confidence in the prediction made at any given point 
’ (BH1S5,5) (BH167,2), while maintaining relatively uniform’ 


precision across more than half the surface extending 


Rid 7 


from the center of the space (BH169,1). 


In addition, this approach deemphasizes precision in favor of 
greater enaewey in the model required to fit the empirical Anta. 
(BH152,1). The primary function of the approach {s to estimate 

a complete equation; only secondary concern is given to the nature 
of the individual terms (BH165,2; BH175,2). 

. This class of design was originally proposed for the "ex- . 
ploration and exploitation of response, surfaces" ’ and provides a 
method for efficiently searching a space to find the point of 
optimum response. However, it has been equally effective when 
used to map ‘the response surface within specific boundaries of a 
multifactor space, a more useful application when trade-off 
décisions regarding avatar parameters must be made. This shift 
in emphasis however does not change the inpareance of the funda-| 
mental characteristics of response surface methodology nor mini- 
mize the assumptions svi limitations Gasentaes with its use. | 

Used for the appropriate purposes: and properly exercised, 
“response surface designs provide an economical way of obtaining 
an sree of the relationship among a large number of variables. 
RSM designs should be preceded: by an effort to identify the more 


important variables to be included in the study and followed by 


an effort to obtain precise estimates at particular locations - 


¢ 


within the space, if desired. RSM provides a flexible approac 
that enables the experimenter to design and to modify his inves- 
tigation after the data collection has begun; it does not do 


thinking for him. ° ¢ 


' MISAPPLICATIONS OF RSM PRINCIPLES 


_ The more fundamental features of RSM, cited earlier, are 
listed ‘as procedures ‘in Table L. A comparison is made in the 
table to show were the experiments in the Williges series fail 

to follow the RSM procedures developed by Box and his associates. 
There are aiwayz specific abeieeiens when there will be good 
reasons for not following a particular procedure; however, in 
général, each represents an element of a powerful research metHod- 
Ology and should not be discarded casually. To ignore some of 
these procedures. may be relatively inconsequential when -only aKew 
; variables are Kwiny studied; however, this casualness can lead to 
a marked degradation in the effectiveness of the methodology when 
the number of variables increases beyond that which chavachentens 
traditional psychological experiments. 

In the sections that follow, the short-comings of the Williges 


experiments are described and discussed in detail for each proced- . 


ure, listed in the order they appear in Table l. 


Seguential Data Collection Plans 

Psychologists have traditionally planned replicated factorial- 
type designs and collected the ‘performance “data necessary to fill 
' every cell before the analysis is made. -Ih response surface” 
methodology, economy is achieved by collecting as little data as 
possible until there are indications from an early examination of 


the first-stage results that more observations are needed to 


decrease bias and variable error, The primary emphasis is on 


TABLE l. ce RING THE PROCEDURES USED IN THE 
ILLIGES PAPERS WITH BOXSONIAN ’RSM-CCD 


Fundamental Procedures of RSM Williges Papers* - RSM-CCD 
BS WN Mw 
- Collect data sequentially in blocks, No No No. Yes : 


beginning with only enough for a 
first order model when no function 


is assumed 


- Isolate second order from higher No No Yes Yes 
effects in the analysis when ~. . 
possible - ; / 


*. Collect more data when lack of fit No No = Yes 
is significant (p<s05) for the a> ds 


second order equation , : — Po +4, 


Assign conditions to orthogonal ' Yes Initially, No Yes 
blocks to reduce confounding with : DUE PREOE < wr ; 
destroyed 


irrelevant sources of variance 


Include multiple center points for «= Yes Sometimes No Yes 
remoyite, block effects, achieving . 
uniform precision, and improving | : : 


estimates of second order effects 


Emphasize overdil equatfon rather No No No _ Yes 
than analysis of individual 


coefficients 


” 


at 


~~ * i -. cd a 
*, Only three of the four experimental papers are listed since the fourth, WM, ‘ 
was actually an adjunct tothe MW paper. 
i 


A 


: | Ww 
decreasing Bias error. Box and Hunter express this most’ funda- 


smental characteristic of RSM as follows: "the greatest aeonomy 
in experimentation, as well as the greatest simplicity, will 
normally be attained if we employ at each stage, a polynomial of 
lowest order needed to make further progress possible. We should 
Suis ieee. by assuming that a first tay approximation is 
to be engiaved. This\assumption would be abandoned and a second 
order approximation adopted, only when the first order approxi- 
mating function had proved inadequate." (BH142,2). 


4 


» .Some response surface designs, such as CCD, are planned to 
take advantage of this iterative feature. If one Hsee Hee know 
in advante what the order of the model must be, then it is prudent-- 
economical and efficient--to collect only enough data to’ estimate 
a first order polynomial, plus a little more to test the adequacy 
of fit. If the lower order model is adequate to fit the empirical 
data, then the experiment can be terminated and the investigator 
is saved the effort of collecting data to estimate/ higher order | 
effects that ace negligible. | 
. Even when one suspects: that a first order polynomial may aoe 
fit the data, it still ia a more efficient to start by collecting 
only ‘enough data to fit and test a first order model and analyze 
‘ it before completing the design. This would be particularly true 
when a large number of variables ar&being studied and a flexible 
strategy is desirable. By examining his dite before collecting 
enough to fit a “gecond ordery-model, the investigator has the op- 


tion of éxamining the magnitude of the first order coefficients. 


If he then discovers variables with negligible effects on the . 
> ma ha 
10 19 . 


_ how this eeneeats is used ertecesvaly in a four-variable study of 


F if 
response (by real world standards), these might be dropped from 


the remainder of the study. | . - 
Furthermore, on the basis of this early analysis, oe 
decide whether or not to Sxpand: contract, or shift the, coordin- 
ates of his experimental space or to modify the exeurenant scales 
of some variables (BH148,3; BH175,1). This flexibility can enable 
an experimenter to arrive at a correct answer more quickly and 
cheaply and without ever collecting data at the driotnat coordin- 


ates of the star portion of the CCD. Meyers (1963) illustrates 


data had been examined that was quite different from that which had 


wt: 


plete the second érder polynomial were collected befére the need 


©” cases the first order model fit the data and that the effects of 


been planned originally. 
None of the Hettiges studies employed this sequential and 


iterative data eeliserion approach so fundamental to, they,economy 
| 


4, 


“and efficiency of RSM. Instead, all of the data required to com- 


had_ been determined. Since subsequent analyses showed that in some 


_some variables were negligible, this failure to use correct RSM 


resulted in a great deal of data being collected unnecessarily. An 
investigator who might wish to study a large number of variables 
could suffer a considerable economic loss if he failed to realize 


that the methodology in the Williges papers is not optimized. 


a ; 


Questionable Data Analysis 


In two of the three Williges -papers, after prematurely 
collecting enough data to write a second order equation, they fail 
to estimate the second order coefficients (WB316; WN329 and 332). 
Instead, they obtain only the Avgrelotants for a first order poly- 
nomial and pool the estimates of the gacbna order and higher 
effects into a single term labeled "Lack of Fit." At a later 
analysis, the second order terms were isolated. While not 
employing sequential data sat idebina, an RSM feature, they do 
employ sequential data analysis, a questionable innovation is 
this particular application... 

The two: of scurie are in no way équivalent methodologies. 
While the former can result in a savings of ‘time and effort, the 
latter, i.e., performing a partial analysis of existing data,. may 
lead to faulty interpretations of the results. The procedure used 
in these papers is analogous to collecting data to fill 4 factorial 
design and then isolating’only the main effects while pooling every 
other source of variance. Pooling nonsignificant effects with sig- 
nificant’ effects may mean the presence of significant effects. 

"How this procedyre might detrimentally affect the interpreta- 
tion is illustrated by the fictitious data in Table 2. Thus when 
all 120 degrees of freedom and associated intéractions are podbad 
(Line X, Table 2), the probability of finding reliable interaction 
effects is only 20; idwaver when the more critical two-factor 
. interaction effects are isolated from all higher order effects 
' (Line. Y, Table 2), they become: statistically significant at the 


' v 
i od : “~ 
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TABLE 2. ANALYSIS OF FICTITIOUS DATA FOR A ras ' 
FACTORIAL DESIGN, TWO SUBJECTS PER CELL 


Source of Variance Sum of Squares D.F. Mean Square 


Seven main effects 
(X) Pooled interactions 


(Y) 21 Two-factor 
interactions 


99 Pooled higher 
order interactions 


Residual ~ * - 1600 + 128 4 bead 


Mee -001 level. In this fictitious data, it would have been 
prudent to continue to ‘deelete more higher order interactions 
until, the proportion of the sum of squares remaining was small 
when compared with that of the main effects. 

That the same kind of confounding found in this eideiedous 
“aaed would be found in the Williges series can be deduced from 
the results reported in. some of the papers. In the Williges-North 
‘paper, they report no significant lack of fit at the conventional 
p= el level in any of the combinations that they initially ana~ 
lyzed (WN327 & 329). However, when they isolated the second order 
effects--the data having already been tial ei found "sig- 
nificant second order effects" (WN331,1). | 

A’ similar but reversed situation occurred in the Williges- 
Baron paper. Init, the combined second order and higher order vt 
Lack of Fit test was not statistilcally significant (WB316). When 
the second order coefficients were isolated, they still were not 
aiatis icatiy significant,. but the remaining Lack of Fit term-- 
now composed only of aliased higher order effecta--became signifi- 
cant (WB318,2). The presence of significant higher order effects 
when second order effects were not’ significant in a three-factor 


study suggests that the third order interaction might be spurious. 


An inspection of the raw data could help clarify the interpretation. 


Significant Lack of Fit 
In two of three Williges papers (WB318,2; MW343 & 344), some 


second order analyses revealed a statistically reliable lack of fit. 
This meant that the equation did not adequately represent the data 


and that more data would have to be collected to identify the 
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crucial higher’order terms. In none of these cases, however, did 


the investigators continue the experiment. Failing to continue the 


experiment in the presence of a significant lack of fit is neither 


proper RSM nor good research since the investigation has been 
stopped before a correct answer has been obtained. 
There are times when an investigator miehe justifiably halt 
' data collection in the face of a significant lack of fit. If the 
significant lack of fit test actually accounted for a negligible 


proportion of the total variance in the experiment compared to 


that of the variables of interest, and it had been judged "signi- 
ficant”" only because Of a proliferation of degrees of freedom in 
ee of the F-test, an investigator might deotae to 
absorb this error rather than go to the extra expense of collecting 
gaaitedenat data. This of course assumes that he has attempted to 
identify the source of this higher order effect through an examina- | 
tion of his raw data and particularly interaction effects that can 
be calculated from the data in the cube portion of his design. . 
On the other hand even if the Lack of Fit term were not 
Statistically significant, if it accounts for a relatively large 
proportion of the variance, then one should not assume the fit is 
adequate. For example, in those Williges papers with enough’, 
published data to make the calculatiors, (MW344) the auanoshion of 
total variance accounted for by the Lack of Fit term--judged det 
significant--was three-and-one-half times greater than two of the 
four significant*experimental variables and one-and-one-third times 
greater than a third one (MW344). Under these circumstances, it: 


* 


would be unsound to ignore the. lack of fit as long as the investi- 
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gators considered the other variables worthy of further consider- 


ation. 
This emphasis on the proportion of tétal variance in the 

sample (i.e., eta squared) rather than on the significance test 
for identifying critical variables by important for a number of 
reasons. As One can observe throughout the Williges papers (and 
this will ba discanwed Ameer in'more detail), sources of variance 
become more or less "significant" depending on how much data the 
investigator may have collected. If a basic, unreplicated CCD is 
employed, there are relatively few degrees of freedom in the error 
term; this makes the power of the significance test quite low. In 
that situation, the investigator would be better off relying on the 
relative magnitude of the coefficients (BH175',1)\rather than on the 
results of an F-test to decide whether or not there is an indica- 
tion of a lack of fit.? . . | 

- This ean. be illustrated by the data from a paper by North and 
Williges (1971) not in this series but which was a preliminary 
version of the paper (WN) published in Human Pactra, 1973. A por- 
tion of their Table 15 is replicated in Table 3. With 20 and 3 
degrees of freedom, an F of 4.301 can occur by chance approximately 
13 times out of 100 samples: taken from a single population. The 


investigators, having used the .05 probability level as a standard 


4 


3when the terms of an equation are orthogonal, ‘a beta coefficient | 
‘equals a Pearson product-moment correlation. These also equal eta 
which is the square root of the proportion of total variance 
accounted for by the term. ‘ : . 
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TABLE 3. 


LACK OF FIT TEST TAKEN FROM WILLIGES 
AND NORTH'S (1971) TABLE 15 . 


oo 


Source of Variance df : _, Nariance © FE 

—_—. 
Lack of Fit 20 0.18 4.301 
Replication 3 0.04 


a ae | « 
aay rejecting the null Kipewhaets in other tests gt significance, 
refused to reject the null hypothesis when p approximated re 
The investigators made no Harehexsitere to’ look for higher order 
effects even ceca the LACK of Fit .ermt accounted for .493 of ‘the 


total variance at’ ‘the same time the entire Linear regression of ie i 


- four terms accounted for only .488. it mS A me . 


haturally into separate blocks." (BH175, 1) This concept was 


hypothesis when the probability of error was 15/100, they ‘were 
willing to accept the hypothesis ShBS. the equation adequately fit 


-the data when to do so with only. 20 aiid 3 degrees of freedom 


‘of variance hqwever was an excellent indication that the fit was 


_Rot adequate. Accepting the null hypothesis’ in that case could 


“obtain the proper equation could result in improperly designed 


equipment or incorrect estimates of performance. 


Orthogonal Blocking 


of an unknown function of several independent variables, an exper- 


* “This meant that while they had. refused to reject the null ~ 


meant that the probability for error was 60/100. With the three. 
degrees of freedom, the test of significance. was too insensitive 


to bé used as a criterion for the adequacy of fit; the proportion 


result in a Type. ‘II ‘statistical error as well as an error that 


could have considerable practical significance... A failure to 


x 


Pate Box and Hunter write: "In attempting to explare the response 
itenter! s strategy generates sequences of experiments that fall 
inferred in the. earlier discussion on "Iterative Data colléction 


Plans." In cxperinents using a ccb, the cube and the star: ‘portions, 


individually, are complere experiments capable of measuring all 
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| effects and tests’ to see whether the resulting model adqueately 


first order effects; together, they are orthogonal blocks of a’ 
second order cep. 

Orthogonal blocking refers to the grouping of data collection 
points in an experimental design in such a way that differences in 


wear EM Ovenaas among blocks will not affect the: wntinates of 


effects within blocks. Orthogonal blocking rarely “has’ been used 


by psychologists in spite of the ‘fact pa is a powerful method for 
minimizing the effects of unidentified sources of variance in exe “ 
perimental data, of effectively conducting studies when the 


availability of subjects or materials is. restricted, and of econ- 
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; omizing when sequential data collection stvutedies are employed: 


Simon (1970a; 1970b; 13V3; 1974, pp. 100-103) describes blocking 
techniques and illustrates ways they might be employed in various 


types. of human factors engineering research. Orthogonal blocking 


is an integral and important techniqte for response surface method-" 


ology and can be used to maximum advantage with ccDs (BH174~178) . 


Since ee first block of data for estimating first order 


“fits” the" data, form a complete “experiment, the “study “might be: 


’ terminated if the data so warrant. However, if the experimenter 


decides to continue to collect new data (after taking fail advan- 
tage of the results of the first experiment to decide whd?-new 


data should be taken) , he is faced'with the problem of handling 


nh, “shifts in average performance from known or unknown > tauses that 
ae may occur between the time the two experiments (or two parts of 


: >the. ccp) were ‘run. 
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With the appropriate selection of certain parameters affecting 
the CCD, however, the design can be orthogonally blocked and the 
investigator can collect die data with confidence that any mean 
eerimiane a ee between the two blocks will not affect the 
linear and second order coefficients of the polynomial generated 
from the combined data. - Furthermore , iideeicwa effécts confounded ~ 
only with blocks can be removed. *. 

For ‘example, if mean performance shifted betwakn blocks as a 


result of uncontrolled drift in the equipment or environment or if 


different stimuli, subjects, or experimenters .(WB313, 3) were’ 
assigned to the orthogonal blocks, thén the ‘average effects assoc- 
iated with these sources of varianc 6 would be confounded with | 
. he average effects “of blocks. H fever, since orthogonal blocking . 
is used in a properly designed D, these unwanted effects not only. 
can be isolated from the erro ‘term, but will also ‘have ‘no effect: 
on the estimates of the co ficients of the: ‘second ’ order, pelynenial. 
This technique fo Gleansing experimental data’ can. be: ‘extended | oh 


e (i.e., 2 ) portion can. be. blocked stiln 


aren apo ’ idee pee 


in CCDs since aa ; 

“further. Any eal design of three ‘or more “Wariables can. be nat Are 
into blocks in such a way that the effects’ ano blocks will’ be” 
orthogonal to all first and second order effects. “For example, a 
23 factorial design can be divided into’ two orthogonal blocks’ of. i ag 
four points each; a 2? factorial design’ can be divided into 16 

orthogonal blocks of eight points each: Thus trends and other 

biasing effects of unidentified factors running thr h the data; 
can be eliminated or reduced by this process of dividing | the design Bes 


plan into sub-sets or blocks. 


° 
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As Myers (1971, p. 176) writes: "Blocking becomes an essen- 


tial part of the experimental procedure when all of the experimental 
runs required by the design cannot be made under homogeneous es 
ditions." In behavioral research, the prudent experimenter should 
ordinarily block automatically to keep the estimates of ntevest as 
unconfounded as possible. The Mills-Williges study, however, 
failed to incorporate orthogonal blocking into the design. Asa 
consequence, after the fact, there is no way of knowing to what ex- 
tent uncontrolled, unmeasured, and unidentified sources of variance 
were distorting--in either direction-- the estimates of the regres-. 
sion coefficients, ‘the lack of fit, and the so-called error estimate 
“(whieh absorbs much of this variability). 

When REEPRGONA LAY blocked designs are available, it is heivehee 
good RSM nor good experimental me Enede logy in general not to use 
this valuable technique. orthogonality in ents study was lost when 
the investigators decided not to use multiple center points in the 
basic CCD and failed to adjust accordingly the noncentral coordin- 
ates for the star points, referred to as +a, for each dimension. 
They-used-an a” of 22000, suitable for a five-factor, blocked at 


design, when only a half-replicate of the cube portion js used, 


* instead of 2.345, the correct d when a single center point is used. 


Multiple Center Points 


Central-composite designs are made up of hypercubes, measure 


* 


Salieepes (stars), and one or more points at the center of the 


design. Box and his associates cite a number.of advantages if 
more than a single’ center point is used in thé ‘basic CCD. Clark 


and Williges (CW306,3), however, propose to modify the classic . 


Boxonian CCD by eliminating multiple center points in the basic 


design when all other points of the basic CCD have been replicated. 
Instead, they retain only a single center point in the basic CCD 
that would be replicated along with all of the other experimental 
conditions. This plan was followed in the Mills-Williges study and | 
in some of the Williges-North analyses (WN328,3). Dropping multiple 
center points from a totally replicated CCD was the only true modi- 
fication of the basic design that Clark and Williges proposed. The 
result of this change is to degrade the effectiveness of RSM without 
enough sdvantave to justify ae change. The pros and cons of rep- 
. Llicating an entire baate cep wai be discussed later in this paper; 
here the discussion is concerned only with the consequences when “— 
the basic désion (replicated or not) fails to include multiple 
center points. | 7 
The number of center points ina CCD affect the following 
design Shae ene and saactlona: 
1. The test for presence of quadratic effects in the first- 
order model. (BH152,3). 
2% The estimate of Source" error wattance nweded eae alle. 
statistic 1 significance of the lack of fit. (BH169,2).. 
3, he orthogonality of blocked ccDs. (BH176,4). | 
4. The "rotatability” of the CCDs. (BH168,2). 
5. The undformity of the "information" profile, (BH168,4). 
6. The ability to isolate block and trend effects (Simon, 
_ 1974, p. 102). } | 
‘rGumabanie effects. If one or Gere dna points aré included 
along with the hypercube portion of a CCD, the difference between 


~ 


‘the neha oe the center points and the mean of the 2* points of the: 
hypercube provides estimates of the sum of ‘the quadratic effects 

and the yarisnce Xo be used to test for a lack of fit of the linear 
model. While this test might be made from data, taken at a single 
center point, data from multiple Genter PSnSh ay face subject}: 
will provide a more stable estimate. 

Error eettneke, Without overall replication, multiple center 
points in the basic CCD providethe only estimate of experimental 
error. This estimate should be made up of the "chance" variability 
that occurs when the same point is measured several times.under the 
‘same conditions; it can be contaminated from variability associated 
with effects that occur when data is tested sequentially.‘ However, 
when every point in the basic design is replicated; Clark and Will- 
iges propose that only a single center .point be used in thé basic / 
CCD since another soyrce for estimating experimental error would be 
available (CW306,3). What ‘they fail to. indicate is that it would 
not be an equivalent "experimental error", nor would it be as "pure" 
~an-estimate-of error ‘ | o — 

When there are five veraret eas as in the Mills-Williges experi- 
‘ment, the basic cen neti wouta be made up of .30 experimental con- ' 


ditions of which four would have’ been =ppearee measures at the 


center point. In that design, the Subject-by-Center Points variance, 


4eurther contamination would occur if an experimenter tested a diff- 
erent subject on each condition (including each repeated center 
point) of the design. Considering how variable subjects often are, 
this confounding of subject and conditions differences would ordin- 
arily not be warranted if only a single replication of the. design 
were used. 
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with 3x 329 eee eee of freedom, could have been estimated and 


¢ 


ortee 


used as a relatively "pure" estimate epg error. However with the 
venlavation, Mills and Williges decidgd to eliminate: three of the 
four center points’ leaving only 27 conditions in the basic CCD. 
Therefore, instead of an error term involving the variance of the 
repeated center points, they used for their error variance a term 
labelled "Replication" (MW343;343), which was actually the Subjects- 
by-Experimental Conditions interactions. : 
Subject-by-Conditions interactions may occur,’ not by chance, 
but because such effects often actually exist. They may also occur 
* when truncated, "ceiling andfloor" effects are present, and when 
there are uncontrolled iid unisoiated séquence effects (trial-to- 
trial transfer as well as,long term trend), and when uncontrolled . 
incidents occur during the data collection. Any argument regarding 
. the purity of this error estimate might have been stronger had 
linear, quadrantic, and cube trend effects (a total'of 9 degrees of ; 
freedom) been isolated from the "Replication" term; or had it been 
demonstrated | that the Subject-by-Linear Terms and Subject-by-Quad= 
"ratic Terms interactions (a.total of 25 and 45 degrees of freedom 4 
respectively) were not ‘significantly greater than the Subject-by- = LL 
Lack of Fit term, the most likely term to represent “error. " In any 
case, had the design with multiple center points been used, this 
entire question of an appropriate error term would have been avoided. 
Orthogonality. For orthogonality between estimates of the first 
’ and. second order coefficients, a certain relationship-must exist 


between the number of center points, the number of experimental con- 


ditions. in the first and second order blocks, and the value of a 


4 
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_(i.e., the distance from the center to the points of the star) 


(BH176,4; CW301,2). It is possible to obtain this relationship 
with only a singis Santas point in the basic CCD design, even if 
the single center point were located in the star block. But or- 
dinarily, if the investigator intends to use orthogonal blocking 
along with the iterative approach proposed for RSM, he would begin | 
with the points of the cube block, since the resulting data allow 
in ‘immediate test of the presence of cross=product, econ oclar 
effects. Then with center points added to test for possible quad- 
ratic effects, he is forced to Was leap te center points in his 
completed CCD since at least one other will be required in the 
star block and additional ones in the cube portion if it’ is sub- 
blocked. The use of. the single center point might be acceptable 
only if the investigator decided to take the less efficient _, 
approach of starting his experiment with the star block first. 
“This entire consideration was avoided, however, in the Mills- 


| Williges study which used neither blocking nor the iterative approach. 


Rotatability. A rotatable design.is..one..in-which~the--precision-"-™ 


of an estimate is the same at all points equidistant from the 

center of the experimental space. Rotatability is a primary feature 
in many of the response surface designs. With CCDs, rotatability = 
is obtained by selecting the proper value for ‘the length of the 

axis arms of the gtar, a (BH171,1; CW229,3). However, with 
exceptions, the a values appropriate for orthogonality and for 
rotatability, while reasonably enese when | multiple center points 


are used, are not equal. 
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In general, A tab agreed that when a decision must be made, 

the quality of orthogonality is more important to preserve than 
that of rotatability (BH177,3) (CW301,5). When only a single 

nter point is used for orthogonal blocking, however, the discrep- 
ancy between the a for orthogonality |and for rotatability increases; 
if the one for orthogonality is chosen, the rotatable characteristic 
is further distorted. While not necessarily a serious matter, it is 
still another degradation that occurs when single center points are 
included in the basic design. . 

_ Uniform information profile. With only a single center point 
in the basic design, "information" at the center of the: response 
surface will be less precise than at points further feo the center. 
"Information" at any point on the response surface is the reciprocal 
of the variance at that point (BH166,2). Since the center of the 
experimental space will ordinarily be that portion in which there 


_ is the greatest interest, Box proposes that additional (multiple) 


f 


center points be included in these response surface designs to ae 


make the contour of the information profile approximately constant... 
“over the central interval between the two levels of the cube por- 
, tion: Beyone these points, precision is allowed to degrade consid- 
erably (BH169) . °° In the Mills-Williges experiment, with only a , 
single center point in the CCD, the precision of performance eati 
mated at the center of their experimental space is poorer than that 
estimated away from the center. _ — a 
Isolating block effects. As. stated earlier, with only a single 
center point, an orthogonally blocked design is possible. “However, 


with a single center point in the CCD, no estimate of block effects - 


This means that no trend or other effects that might 


. is possible. 


be confounded with blocks can be isolated, the consequence of which 


is to distort the] estimated error variance. This in turn will 
nea: s of statistical significance. It would be opti- 
i assume that these effects are negligible in most human 


rch. It is only prudent to use methods that will 


orthogenal to one another (BH163,1). While Myers (1971, pp. 133-134) 
describes a way to adjust the design so that this correlation would 


be eliminated, the adjustment will affect other characteristics of 


;. the design and ‘is ordinarily not justified.? While not a serious . 
matter when one does not evaluate each term of the equation, the 
degree of correlation among quadratic terms is greater when only a 
- single rather than multiple center points are used (with other 
— parameters properly adjusted), as shown in table \,. THE CONSEQUENCE nnn 
“ge this correlation is to make the estimates of the coefficients of 
the quadratic terms differ depending upon the particular order in i 
. which each is isolated ‘in the analysis. . | 


Sgince this report was prepared, Williges published another paper, ae 


"Research Note: Modified Orthogonal Central-Composite Designs", in- 
Human Factors, 1976, 18, 95-97. In this paper he cites Myers’ \. 

i, Be baa calculations for the alphas required for a complé= 
tely orthogonal design. However, he failed to note Myers' comment 
regarding this’ design, namely: "As we implied previously in this 
section, there are important> choices of a to consider, ‘other than 
the value which takes the design orthogonal. In'many cases, these 
other choices are more’ desirable than the orthogonal CCD." 
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TABLE 4. CORRELATIONS AMONG QUADRATIC TERMS AS A 
FUNCTION OF NUMBER OF VARIABLES AND 


NUMBER OF CENTER POINTS IN CCD | td 
Number of Variables Number of Center Points 


in ccD Single Multiple 


' Number of Center 
‘ blocked Design 


Six* 


Seven* | --.164 -.044 (14) 
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Deemphasizing Individual Coefficients 


It is generally good practice to draw as much information from 
the experimental results as possible. However, the type of equa- 
tions generated by response surface designs, such as the CCD, were 
never intended to be examined term by term. Box and Hunter write: 

"Now the primary object of the experimental designs described 
in this paper is to estimate an unknown response function by means 
of a mathematical model obtained: by using a Taylor's Series expan- 
sion of some order. Using duck an experimental design, observa-_ 
tions are recorded at N points in the factor space, and this 
evidence is used to-estimate the coefficients of the model by 
least squares. The interest therefore is really directed at the 
complete estimation equation and not an investigation of the 
individual estimated ‘ccar?icianke and their variances." (BH165,2). 

In CCDs and other response surface designs, the Pe of 
the various estimated coefficients is not constant and, as has 
already been noted, some coefficients may be sonreletwaa, The 
effects of-this-are-discussed quite thoroughly by Box and Hunter 
(BH16 3-167) and are of little concern if the important consideration 
inthe fit.of the overall equation. Significance tests are to be, 
applied to pooled estimates of the different orders of the model, 


i,e., first, second, higher (lack of fit), rather than each indivi- 


dual term. In this regard, Box and Hunter write: 
"It should be noted here that the individual coefficients of 


the model have not been separately tested for significant departure 


_ from zero. If this had been done; and one coefficient was found not 


to be significantly different from zero, we would not be entitled to 
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replace the given estimate with a zero, for regardless of its 


magnitude, it is still the best estimate of the unknown coefficient. 
To replace this estimate by a zero would in effect be replacing a 
best estimate by a biased one. The important test concerns the 
order of the model; i.e., whether a model of first order, or of 
second order, adequately veceenwiks the unknown function." 
(BH174,2). | | | 

All of the Williges studies dolinue io satlace the F-test - 
orientation by examining the statistical significance of each term 
of the polynomial. Because in two of the tudes (WB and WN) only 


a partial analysis of the collected data was carried out (a fault 


that was discussed earlier in this paper) ,\ the examination of, only 


‘the: linear terms might provide an erroneous interpretation of the 


reliability of the individual variables. The proper test of the 
variables, rather than the terms, should have included the unanal- 
yzed second order components. Box and Hunter write the following 
concerning this procedure: 

| "Another test that could be run would be to determine whether 
a particular variable Xi contributed significantly to ite response. 
In-this case the sums of squares of all the coefficients bearing | 
ani subscript would be pooled and then canted, However, the 
search for the important or significant. variables should properly 
preceed [sic] the ‘estimate of a response function by a second 
order model." (BH174,2). 


Contrary to the examples provided in the Williges SSpere the 


last sentence in ‘the above quote emphasizes the strategy whereby 


the search for important or significant variables should properly 


a 


precede the collection of data for fitting a (second order) function. 
In practice, -since the number of candidate variables bak conceiv- 
ably might have a critical effect Sh eerieeanee can be quite 
large--15 to asain most human performance tasks, considerable 
screening should have taken place prior to the effort to estimate ES 
a response surface. | | 
The task of identifying critical, variables and the task. of 
relating them functionally should properly be done in two distinet ' 
steps; this is the only economical and:‘efficient jidatie ce handling 
truly large numbers of variables (Simon, 1973). The first-order 
phase of a CCD can be daed For the identification purpose, as Box 
and Hunter suggest, but for truly multifactor research, a more 
invaiaive, preliminary screening effort might more practically be 
carried out. | | 
In -the Williges papers, while examining indiy, dual terms, 
the authors fail to warn the reader of the correlAtion among the 


quadratic terms. Since the effects of these terms depend on the 


order...in..which-they~are-isolated in-the regression analysis, the 
reader should at least realize that any test. of significance will 
be affected to some degree however small. In the Williges-North - a 


paper, when analyzing tie uncollapsed design, the authors throw 
out the data collected on each subject for three trials at the 
center (WN328,3). - Since the remaining data had been collected with 
‘appropriate a values for the complete design, the analysis is no 

| longer being made’ on a properly blocked design and estimates of 


some first and second order coefficients ‘will be correlated. | 
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TREEEEE ROR RON, of individual terms Bader these circumstances 


fois tenuous evi’ if the SERSETONE SS: is aware-of what he has: done. 


Furthermore, the papers do not make it Glear that although a 

single error term Magu be adequate to test the reliability of the . 
entire equation, er use to test individual. coefficients that 
differ in nese Lelonaoke nike interpretation of such an analysis 
ambiguous. While an examination of danulte in depth is always 


desirable,. the investigator should be aware of what he is doing an& 


its weaknesses. These are not brought out in the examples in the 
Williges series. oo | | | 
| Finally, Williges and North suggest that one might keep 
. certain “marginally reliable" coefficients if one were seaehiing 
. the experimental sare MNSoe, 2 but not if one wanted the more 
valid "and geabie qverall prediction equation (WN333, 3) eo As Box 
“and Hunter note, the equation woah: be biswed if masgenelty sig- 
nificant’ terns were ‘CRESERSs rt. is difficult to understand why a 
' biased equation’ is more acceptable Hoe -vaepobes of prediction than 
- for search as Williges and North suggegt., In the Williges-North 
‘paper, the idea of dropping nonsignificant terms is promoted on the | 
: _ grounds of parsimony.: But which terms are. significant changes in 
‘ these Papers each time more replications are added and would con- 
tinue to do so until every term would eventually become significant 
_(Bakan, 1966, p. 426; Hays, 1966, p. 326; Kleiter, 1969, p. 10), so 
eh it ig adPetoult to know at what point in the program one should , . 


;decide to drop a term. Under ordinary circumstances, Box and 
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Hunter's approach of keeping the ters once they have been 


isolated ‘seems the more manageable “and accurate approach for 


response surface studies.° 


Safter this report had been prepared, a paper by David J. Cochran 
and LaVerne L. Hoag, "Response Surface Methodology and Optimiza- 
tion -- A Possible Pitfall," was discovered in the Proceedings of 
the Human Factors Society 19th Annual Meeting, October » Ia 
following the recommendations in the Williges series, these 
investigators became aware of what they refer to as a "dilemma 

for which the experimenter is given no method of resolving," . 
namely, .the problems of interpretation that arise when "statisti- 
cally non-significant" terms are dropped from the regression model. 
Hopefully the discussion in this paper will help them resolve 
their "dilemma" which. was not created by RSM but by following 
unwise procedures and by the ambiguities inherent in the.signifi-- 
cance test. - 


33 42 


‘ 


NON-RSM METHODOLOGICAL CONSIDERATIONS: 


i. The second major criticism made of this series of papers is 
that the authors employed poor methodologies not specific to RSM. 
In some cases this was more or less the result ot: catwlews plan- 
“ning; in other cases, however, it occurs as a result of calculated 

decisions. These cases will be described in detail below. 

After summarizing the features of CCDs as developed by Box y 
and his associates, 'Clark and Williges introduced what they refer 
to as "modifications" of the basic, blocked, central-composite 
design (CW295). The modifications are presented as a series of 
alternatives, the relative advantages of which are determined 
empirically by the four experiments in the series. Thus they. 
consider the relative advantages of: 

CCDs with multiple observations at only the center point 

versus CCDs with multiple observations at each experimental 

point. . 

Regarding designs of the latter type, they compare the relative 
value of: . 
Analyzing all of the collected data without modification 


versus collapsing across subjects at each data point prior 


to analysis and also the relative values of using: 


Between-subjects designs in which no subject is ob- 
served meee than once and observations at each axcasic 
mental point might be multiple and unequal or multiple 
and equal; versus within-subject designs in which each 
subject is observed only once at each experimental point. 


@ 
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Contrary to what the authors imply; these variations per se 


do not modify the basic CCD and’ can be discussed and considered 


more or less independently of RSM. They are instead alternative 
procedures that tight. be used with any basic experimental data ‘ 
collection plan, be it CCD or factorial es and so 
forth. In each ‘of these, the na thodclegital considerations are 
essentially the same. Furthermore, although attempted in this 
Williges series, the consequences of the alternatives cannot 
properly be determined empirically, but only through a rational 
determination based on a knowledge of their statistical and AN 

mathematical characteristics. tet us examine each of these altern- 

atives in turn. | | 

Center Point Versus Total Design Replication 

Clark and Williges proposed that rather than replicate only 

at the center of a CCD, every point of the basic design be repli- oS 

cated (CW304,1). Based on an experiment by Williges and Baron, . 

they conclude that total-design replication is better. It will be 

shown, however, that their implementation of total design replica- 

tion was neither in accordance with good RSM nor the abet ecanomienl 

method of ‘meeting the desired objectives, and that the empirical 

study actually offered little support for their conclusions . 

regarding this issue. 
as An investigator may decide to replicate a basic experimental 
design for either or aoe of ee reasons: to measure performance 
more precisely and/or to ebeain an estimate of experimental error. 


The former will lead ta improved estimates of the coeffictents in. 


a regression equation and ultimately the Sacineees of responses. 
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derived from the equation. The latter may be used to establish 
confidence limits and to perform tests of statistical significance. 
Psychologists in general have tended to overuse and misuse repli> 
cation (Simon, 1973, pp. 19-31), often trading precious time and 
“money replicating rather than studing an expanded experimental 
space. Many times the replication has been unnecessary and often 
there are more economical, alternative methods available to meet. 
“the desired goals.° whewé criticisms become increasingly pertinent. 
as the number of factors in the experiment increases. 

Although the "goodness" of an ANOVA design is partially 
determined by how well it reduces variable sla discussions of , 
response surface designs have tended to’ hay: down. concern with | 
variable error. This has been so for two reasons. One reason, 
as discussed earlier, is that response surface designs also | 
emphasize the reduction of bias error (through improving the fit 
of the model to the response) on the grounds that a design that is 
sensitive to bias errors is actually sensitive to both bias and | 
variable error. In this regard, Myers (1971), P- 201) writes: 
"In fact, it would seem that errors hak occur due to bias play 
an even more important role, as far as {the estimated pears y 
is concerned, than those errors which result from senpling varia- 
tion.” Earlier he had noted that only when the Variable. contri- 
_ bution is more than six times the bias would an experimental 
design, totally concerned with bias error, not be adequate. 

“Discussion of variable error has also been minimized in many 
Papers, on RSM because these Pechasaues were deans applied in 


cheni¢al rather, than -agricultural or human performance, studies. 
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In the former, responses tend to be more reliable than in the 


latter ‘types making variable error less of a problem. However, 
‘Box and Hunter do not’ totally ignore the issue for they write in 
accordance with good RSM principles: "In some examples the large 
size of the experimental error would make it essential to replicate 
the experiments. If the sizé' of ‘the experimental error is not 
‘ known it is best to proceed sequentially, performing further ex= 
periments if the silgccaast errors of the coefficients extinated from 
the first set are too large" (BH144, footnote). | 
Thus, unlike the Williges studies in which the decision to 
make multiple replications of the basic design preceded any data 


collection, in RSM methodology each replication is considered a 


new experiment to be added only after examination of the previously | 


‘collected data suggests that it is warranted. As we shall see, 
even when the need for -some replication can be anticipated, the 
massive replication approach proposed by Clark and Williges and 
S used in the illustrative studies is not the most economical. 

But ss ke above aiate, Box and Hunter were concerned only 
that the general magnitude of the experimental error of the i 
observed responses might be large and should be reduced with 
replication throughout the design. Clark and Williges properly 
point out the possibility that the experimental error of the 
observed responses might tale in different parts of re 
experimental design. They write: "When the goal is to approximate 
an entire response surface (rather than merely that portion of the 
surface surrounding the eptinum) , Limiting multiple observations 


to a single experimental point may not be the most judicious 
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strategy. Indeed, the actual variability in response may be”so 
i t > 


great across subjects and data points that it would be unrealistic 


to presume the standard of estimate at the center point is an 


adequate estimate of error at all points" (CW304,1). 

However, except for repeating essentially the same comment 
later in their paper (CW305,3), Clark and Williges never again’ 
consider the problem of heterogeneity of variance of the observed 
responses. ’ Nor is there any discussion of the issue nor how it 
was handled in any of the experimental studies in that, series 
whith used total design replication to offset this potential 
effect. If in these studies the variance of the observed responses 
did in fact differ at different parts of the: experimental design 
(as Clark and Williges suggest might happen), then it was no more 
proper to use the error estimate from these composite but heter- 
ogeneous variances than it would have been to use the estimate 
based only on the replicated center points. Neither estimate 
would have been representative nor suitable for performing a test 
of significance. 

Tolark and Williges do not make it clear to their readers that, in 
CCDs, even if the variance at every observation point of the ex- 
perimental design were essentially equal, neither the variances. of 
the beta coefficients in the regression equation nor the variance 
of the estimated résponses throughout the response surface would 
be equal. Box and Hunter were not concerned with the relative 
precision of the estimated beta coefficients of the second order 
model -- they are not equally precise -- for they consider this to 
be "the wrong question" (BH163,2). Nor are unequal variances at 
different points across the response surface an issue since rota- 
table designs only require that points equidistant from the center 
have equal variance. Variability increases consigerably in 
correctly designed CCDs beyond the +1(coded) points in the design 


(BH169). ' Of course, even in classical factorial designs the 
precision varies markedly across the response surface (BH166,2). 


acer 


When Box and Hunter proposed using the replicated center Heise 
for estimating error variance to est the lack of-fit they did so | 
with "the usual assumption that the variances of all detbrminations 
are equal" (BH169,2). When this assumption is not fiat, it is not 
“correct to combine the heterogeneous variances. One advantage of. 
the iterative approach of RSM is that this eenarsesne +e would be 
discovered early enough to permit some scale ibaa coemations to bs 
introduced to correct the matter before an expensive, massive a 
eepidieation had taken place. 

Now on the other hand, if. in the Williges studies the 
observed variances were found to be homogenous after all, the | \ 
failure to use the iterative RSM approach to replication (as well 
as to model building) could cost * great deal in wasted effort. 
Even a few preliminary tests at selected points in the design 
might have been a nieve ‘eceomical Way ta determine the need to be 
concerned with both the magnitude and the heterogeneity in per- 
formance variability. | 

' Clark and Williges write that the Williges-Baron study "affords 
a striking demonstration of the effect of eakinauind experimental 
error ata wines replicated point as opposed to estimating it 
across a series of replicated points" (CW304,1). Actually, the 
“study did not consider the original issue of variance heterogeneity 


at different points in the experimental design. Instead, what this 


empirical effort “demonstrated” was that "when replications were 
restricted to the center points, none of the experimental factors 
was found to contribute significantly to the response level, 


despite their apparent importance in the resulting prediction 
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equation. When multiple observations were made at each of the data 
points, however, the subsequent analysis Suyealaa enat some of the 
‘experimental variables jee significant in determining the response 
level." a | | 

These statements are true in fact but false in implication. 
All this study demonstrated was the obvious fact that -when the 
degrees of freedom for the ee term are increased, the signifi- 
,cance test becomes more sensitive. Making the point that many 
have made, Hays (1963, p. 326) states: "Virtually any study can 
be made to show significant results if one uses enough subjects, 
regardless of ‘how nonsensical the content may be.... This-kind of 
testmanship...clutters up the literature with findings that are 
often not worth pursuing, and which serve only to obscure the 
. really frpertantpredietive. setaticns that occasionally appear." 

Nunnally (1960, p. 643) reiterates the same point by saying: 
"If the null hypothesis is not rejected, it is usually because the 
N is too small. If enough data are gathered, the hypothesis will 
" generally be rejected." Certainly no empirical effort is required 
to illustrate this fact, and, more to the point at hand, it does 
not decisively demonstrate the relative merits of ‘the two proce- 
dures since the particular effect that Clark and Williges use as 
proof for, their conclusion also could have been achieved by 
replicating the center point (in this example) twenty more times. 

When the variability of the observed responses is suspected 
of being larger than desirable and the possibility of variance 
heterogeneity throughout the design is a concern, Dykstra (1960) 


proposes using partially duplicating response surface designs. 


7 


40 49 


ty oe eH ee teh BE ee si 
a 


Conbined with the iterative approach of RSM, these plans ovevida 
essentially the same information that the Clark-Williges massive 
replication plan offered and do so far more scunonceday When - 
_ truly multifactor experiments’ are conducted, ‘this saving can become 


considerable. ' 


Analyzing Collapsed Versus’ Uncollapsed Data 


The major purpose of the Williges-North paper,’ they say, is 
"methodological" (WN323,3). Clark and Williges .(1973) discussed 
two ways of analyzing data collected from a completely replicated | 
RSM sahenaiaconpeaten design. One, all of the data could be. 
analyzed directly, or alternatively, the data could be collapsed 
across subjects prior to analysis, thereby reducing the désign to fe 
the exutvatiient form of an unreplicated, basic RSM central-composite 
design with repeated observations only at the center. These 
alternate analyses were compared in the Williges-North. study nln. 
terms of their resulting sensitivity and in éerna of the eledkcetce 
validity of the regression equation as determined through cross- 
validation. : 

Two conclusions cited by Williges and North were that the 
uncollapsed designs produced a more sensitive roast than sottapwed 
designs and that uncollapsed designs gave more realistic predic- 
tions than collapsed designs (WN334,3). In the disdusaton that 
follows it will be shown that the first conclusion is inherent in 
the F-test and needs no empirical verification and that the second 
conclusion is not supported..by the data. 


Design sensitivity arguments. As the investigators themselves 
noted (WN329,3), the analysis with the uncollapsed data was more 
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sensitive--which eat that more terms were found to be statis- ] 
tically sagas SOARES EnAB the collapsed date because of the drop 
in degrees of freedom--from 120 to 3-~in the error term after 
collapsing. Just why the investigators “felt the need to see eae 
an empirical study to demonstrate this fact is unclear. In ‘the 
preceding Wijliges-Baron study they had discovered (?) that total, 
ina increased the degrees of freedom in the error term 
thus causing a more sensitive F-test. | : : v 
Now in this Williges-North study, they reverse the piocedure 
--since. averaging across subjects is. essentially equivalent to . 
removing replication--and lose degrees of freedom in the error 
term and consequently sensitivity in the F-test. Later, when the 
results from the cross-validationsstudies are combined with the 
original data, i.e., essentially adding still more replications to 
the uncollapsed data, the F-test becomes even more sensitive 
(WN331,5). Since the value required for a significant F dagreases 


as the number of degrees of freedom in the error term increases, 


these results could have been predicted without any empirical 


study. Insofar as: that conclusion is concerned, the experiment : 
was iryelevant. . | 
Of a more serious concern, however, is the interpretation 
implied by the investigators in both studies (WB and WN), namely 
that the design that obtains the most statistically significant 


terms is necessarily the better one. But it is not a suitable 


criterion; in fact, as Lykken (1968, p. 158) says: "...8tatistical 
significance is-perhaps the least important attribute of a good 


experiment; it is never a sufficient condition for concluding 


f 
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that... a useful empirical fact has been established..." Hays 


(1963, p. 300) states: "It is a grave error to evaluate the 
'goodness' of an experiment only in terms of the significance 


levels of its results... it is entirely possible for a highly '. 


significant result to contribute nothing to our ability to predict 


, behavior, and for a nonsignificant result to mask an important 


gain in predictive ability." 

. Dunnette (1966, p- 345) comments how most psychologists 
"still remain content to build our tieovaticat castles on the 
quicksand of merely rejecting ‘the null hypothesis" and Nunnally — 
(1960, p. 650) waxns: / "We should not feel proud when we see the 
psychologist smile and say ‘the caprelevion'ts: aleniticens beyond 
the .0l level'. Perhaps that is the most he can say, but he has 
no reason to smile." Campbell and Stanley (1963, p. 22) sum it. up 
quite simply by saying: "Good experimental design is separable 
from the use of statistical tests of significance." ; 

In the context of CCps, the primary purpose of the signifi- 


cance test is to discover the adequacy of fit of the equation, and. 


even for this purpose, as stated earlier, it is best used merely 


as an adjunct clue after examining the relative proportions of the 
performance variance accounted for by the regression and by what's 


left over.~ 


. c, 


Cross-validation arguments. Williges and North cual yee 


evewescui dation studies and concluded that ", ..uncollapsed or 
within-subject analyses as augaeneee BY Clark and Williges (1973) 


appear to provide a more sensitive analysis as weet as more 


realistic estimates of the predictive worth of the regression 
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equations as compared to collapsed analyses when predictions of 


individual performance. are made" (WN334,3). a mote straightforward 
conclusion appears in their summary, namely, "the uncollapsed, 
*., within-subject designs provided the better prediction equations" 
(WN321, summary) as compared to collapsed designs. The point of 
‘discussion here will zee be whether one Soca of the equation or 
the other is in fact better but whether the investigators properly: " 
interpreted their data and whether they employed the methodology 
that would permit this type of ‘conclusion Ba be ‘drawn at all. 
The basis for the. conclusion: drawn by. widitcel sn North was - ‘ 
not, as is usually the. case, how well the equations, ‘derived from 
one data sample, predicted performance obtained from a‘second data 
sample. HerEARR was how welt the correlations between pre- 
dicted. and observed _per formance from two ‘samples agreed with | 
estimated POPULAELON correlations derived by applying a mabetnkege" 
formula to ue data from the first sample. the greater ‘the 
difference between the empirical and theoretical correlations, ~ 
with the: latter. being used as the standard of gdodness,, thé. poorer | 
Williges and North coneluded their empirical results to be. , 
Essentially what these invewtigatens seen to be Shae ag: is that. 
equations that ought to do better: (but dian’ t) are better than’ 
equations that did do better Ee oughtn't to have according to a 
formula of aveetianabie merit). There: are .several formulae for 
estimating shrinkage, each with i¢s.oun assumptions aad Limitations. 
, Although they had originally used arpivicad desults to , 
evaluate and select their shrinkage formula (North and Williges,. 


‘1972, p. 221), they now use the theoretical estimations to evaluate |” 
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the empirical, Shrinkage Formulaé are intended to be cunea in lieu. 
of further empirical tests. Although, many questions: can be raised 
concerning the ocehulvens of cross-validation studies » (Smith ,, (1970), ° 
ave Miakaes if a competent data collection program has been under~' 
taken, tHen’ the fact rather than, the theory should be the criterion 
by “nich the equations (and designs) are to be evaluated. “The 

| question is simply: "which analysis--of uncollapsed sevestiapaad 
data--produced the equations that predicted the. actual performance 
rfrom a second set of data more accurately? : 

_ In Table 5, representative data extracted from Tables 5 ane 6 
in the Williges-North paper (WN333 and 334) are presented for one 
condition, "Latency response with the black-and-white TV. system." , 

‘There seems to be little question that equations derived from the 

RO collapsed data in general always estimated observed performance as 

well or better than equations derived from ieptiaped: data. This 
, is true both within and between samples. This should not come as 

any surprise since in the collapsed data, one major source of un- 
controlled varjability--subject’ differences--has been removed. 

- Yet Williges: gies North conéluded otherwise. 

But. whether or nat. the Bab een eS data fad - been properly 
interpreted was actually a moot ‘point. “The data ‘collection method- 
ology in ig csout was so confused that any conclusions regarding 
_ the relative merits of collapsed versus uncollapsed data would be 
questionable *pecniise of the other conditions irreconcilably con- | 


founded with these two wit ghiauivea: The following. are the more «.- 


obvious examples: 


TABLE 5. CORRELATIONS BETWEEN OBSERVED PERFORMANCE DATA 
AND VALUES ESTIMATED FROM EQUATIONS Base ON 


Shas 


_ Regression 
Model of Estimation 
Equation Derived From 
First Data Sample — 


Collapsed, 2nd order model 
Collapsed, lst order model 
Uncollapsed, 2nd order model 


Uncollapsed, lst order model 


DIFFERENT REGRESS LON MODELS * 


. 


Source of Observed Performance Data 


Results 
from 
First 
Sample 


Results from Second Sample 


Collapsed 


Uncollapsed 


-438 
433° 
-425 


.450 


* $ . 
This data was taken from Tables 5 and 6 of the WilYiges-North (1973) 


data. 


paper. ‘It is only the data for the Latency response scores for the 
,. black and white TV system, yet it is quite representative of all the 
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‘The collapsed equations are based on' median performance 


measures while the uncollapsed equations are based on 
mean performance measures. If the subject data is skewed 
and/or skewed differently’ for different experimental 
conditions, then the equations could predict differently 
without regard for the collapsing issue per se. 


The designated "error" variance for the collapsed equation 


was actually an average within-subjects variability of 


measures all taken at the center of the experimental 
space. For the uncollapsed equations, the designated . 
"error" variance was actually the interaction between 


subjects and the entire set of experimental conditions. 


‘Using different definitions of "unexplained" variance 


affects the proportion of total variance accounted for by 
the equations and differentially affects the tests of 
statistical significance based on this error variance. 
All data ‘initially collected were included in the deri- 
vation. of the collapsed equation, while sixteen percent 
of the data were excluded from the derivation of the un- 
collapsed equation. The excluded data had Scie. #¥om the. 
center points the investigators judged were superficial. 
In the collapsed equation, the coefficiehts of all first 
and second order terms were independent of block effects 
and the effects’ of one another. In the uncollapsed 
equations, first and second order terms were biased to 


some degree since dropping the center points destroyed 


the orthogonality of the CCD being used. 


‘5. The investigators, concerned with possible "sequence" 


effects, stated that they cqunterbalanced the order in’ 
which the blocks were administered although any mean 
differences among blocks would: have been neutralized hs 
anyway with the orthogonally blocked CCD. On the other ee: 
‘hana, they did not indicate, the method used to ‘control 
unwanted sequence effects which are Likely to occur when 
a subject is tested serially -on the ten conditions within 
Blocks: Since complete counterbalancing of the serial . 
order of ten conditions with only six subjects, as used 
by Williges and worth, is not possible, any sequence 
effects that may nave, Securrdd would differentially. 
affect the two equations. One source of sequence effects 
is confounded with the wGhjedk<byneondltions interaction 
and, if not properly isolated, would distort the main 
effects of both sets of data and inflate the error term 
in the uncollapsed data. | 
Since none of the above is an ee characteristic of collap- 
sing or not collapsing data, the confounding of conditions. prevents 
clear-cut assessment of the ete Pe males of these two methods | 


of analysis from the data presented. 


Between-Subject vs Within-Subject Designs 


Clark and Williges state that "when noncollapsed designs are 
used, the investigator must make another major. design decision - 
with respect to his selected design. If, due to the nature of his 


research problem,_he chooses to observe different subjects at each 


of the ‘experimental points, the resulting study constitutes a 
_ between-subjects design. If, on the other hand, Hie’ élects' to 
observe'each subject under all experimental conditions, the re- . 
sulting study constitutes a within-subject design. The’ ‘choice of 
a between versus a within-subject: design is dictated by he 
particular question which the researcher is investigating.” -In 
either case, if the necessary péecelations are observed, the | 


design conforms to the basic central-composite design" (CW305,3). 


“Of course whether the same or different subjects are used 
isa mesioao Segal uewkion that is independent of RSM and CCDs, 
and that could be made Aatonty "when Rongollsnaes designs are 
used" but also when collapsed designs are used. ‘Furthermore, 
there is a third alternative available to an éxperinenter concerned 
with the serial assignment of experimental conditions to subjects, 
which has certain methodological advantages not mentioned in the 
Williges series. "Thus, different groups (as well as numbers) of: 
“ subjects may be used in each seahaa and under the- proper conditions 
could be used not merely asa means of building up the degrees of 
freedom of Ahe error term, but to, control and isolate sequenge : 
effects within blocks. 7 

This experimental strategy was illustrated in a study by 
Mueller and Simomwhich is described in a paper by Simon (1970b). 
Although Clark and Williges (CW307,3) warn of the importance, of roe 
"proper counterbalancing" in within-subjects designs "so as to 


avoid spurious sequence effects," except between blocks where it 


should not matter when correctly orthogonalized designs are em- 
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ployed, proper SouaRerba tances? within blocks was neither described 
nor employed in the two parece of the series (WN and MW) using 


within-subject designs. oe ‘ 


. 


EVALUATING RESPONSE SURFACE METHODOLOGY ; 
= / 
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The third major criticism of the series was Shae the authors 


offered no evaluation of the CCD in the context ‘ot RSM. Throughout 


the series, it implied tHat in. sdaition to pr a ac RSM 


and testing certain variations to the CCD, the studies also rep- 


. 


resent an empiri waluation of the usefulness of these tools ‘in 
human performance experiments. Thus comments such as the following 
“quotations. are found in the conclusions or, the ‘summaries of.the 
papers in the series: | » 
sa "The results of this study Sepsis indicate that RSM. 

techiiques provide both a useful and economic approach for inves-. 
‘tigating the effects of several variables on | human transfer 
performance.! (wB318, | a ae f ; x 

“re is clear’ from the results’ that RSM central-composite | 
design techniques are successful in providing efficient procedures - 
for generating Rater ene eae seston prediction equations for 
aetabiaes important in. cartographic ‘symbol locations tasks." 
(WN335, 2) | 

"The utility of this approach was demonstrated in that it 
provided efficient date. COLLEGE CAGE). and the observations obtained 
from the response surface equation described complex relationships 
anoag the five parameters investigated." (MW348,2). 

"An RSM central composite design provided an efficient method 


for obtaining data and quantifying the relationship." (WM349, 


summary) . 


In fact, none of the ‘investigations was designed ina ae: that 


could experimentally evaluate CCDs in the. context of RSM. 


Pui studies in the series were oriented particularly to the 
siataation” role. Thus, Williges and Mills (WM349, 3) stated that ' 
‘the purpose of their study "was to investigate the predictive val- 
idity of the RSM Tegression.equation" from a different point of 
view than had been employed for the same purpose by the Williges- 
North study. Williges and Mills debexnined how well the estimates 
from an equation derived from one set of data correlated with 
observed performance values obtained from the same subjects at new 
points in the same experimental space. In the Williges-North study, 

-. after the initial data collection effort, a second set of data was” 
collected: at sie same coordinates in the experimental space’ but 
with different subjects. They determined how well the estimates 
from the equations derived from the original, data’ correlated with 


performance obtained in the second effort. 


Now the procedure in both studies was essentially to collect 
data from sample data points within the experimental space, derive 
a multiple regression equation based on those data, and then see 


if that equation could estimate a second set of data taken at the 


same or equivalent points in-the same space.° To reduce this 
situation to its least common denominator, imagine that instead 
of the points of a CCD, only a single data point had been treated 
to the above EUR aCoLe Obviously then the retest effort is 
merely a measure of reliability (when we make an untested assump- 
tion that the two sets of subjects are homogeneous)’. The same is 
true when retesting is done with the larger number of points of a 
‘CCD or say other design. It is only the reliability of the data 
that is being measured along with the experimenter's ability to 
eliminate measurement and sampling errors and to control for | 
unwanted effects that might occur when the data are being collected. 
There is no measure of "predictive validity” nor of the 
effectiveness of the CCD. Since there was no effort to compare 
- performance estimates from the equation with performance under real 
world operational conditions, no test of the predictive validity 


of the equations was made. Since no other configuration of 


8williges and Mills (MW), for their "cross-validation" test,. 
collect the second set of data from the other half of the 2 

factorial, the first half of which had been used in the cube por- 

tion of the original CCD design. They imply that by examining 

points interpolated among the original set, they are doing a 

different evaluation than Williges and North had done when they 

used the same points. But this is not so, if the basic assumption 

of the CCD is met, namely, that a second order model will adequa- 

tely fit the data. If a second order equation adequately fits the 

data, then estimates Of all main and two-factor interaction effects, 
whether estimated from points for one or the other half of the 25-1 
(Resolution V) fractional factorial, should be identical within 

the limits of the reliability of the measurements. This is so by 
definition. Of course, if that assumption is not met, then- the 

basic principle of RSM -- to continue collecting data to estimate 

Higher order effects until the data is fit -- has not been satis- 

fied. This note does not deny that testing the other half of 

fractional factorial is preferable over repeating the original 

half. However there would be no advantage had the experiment 

satisfied the RSM principle « of data fitting as it is supposed to... 


ee 


‘ experimental data collection points -was compared with that of the 
CCD, no test of the relative effectiveness of CCDs was made. As 
stated earlier, regression equations can be derived from any set 
of data. Evaluating experimental designs requires a test that 
will determine whether sampling the data from the experimental 
space according, to ‘one pattern will result in a more accurate | 
representation of the eaponse surface than sampling the data ) 
‘according to another pattern. | There are many other patterns | 
that might be used in lieu of ces and compared for both economy 
and efficiency, none of which was ever considered in the Williges 
series. : 

Other investigators have compared the CCD with other data 
collection patterns (Box and Hunter, 1958; Brooks, 1955; DeBaun, 


1959). However all employ analytic techniques since an evaluation 


of this sort cannot properly be made empirically. 


’ 


F EPILOGUE 
n b pee 
A paraphrase of a qvéte from John Gardner (1961) would seem 


to be an appropriate way to close: 


f "The society which scorns excellence in plumbing, because 
plumbing is a humble activity, and tolerates shoddiness in 
| [research] because it is an exalted activity, will have neither 


‘good plumbing nor good [research]. Neither its pipes nor its 


theories will hold water." (p. 86). 
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